Conversation
cherry-pick from opendatahub-io/vllm@6100f4b
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
njhill
approved these changes
May 8, 2024
Contributor
njhill
left a comment
There was a problem hiding this comment.
Thanks @dtrifiro @tjohnson31415!
tdoublep
pushed a commit
that referenced
this pull request
Jan 20, 2025
This PR implements support for `batch size > 1 `and tracks the progress of the warmup of multiple different `prompt-length/max-decode/batch-size` shapes. ### Contributions: - Introduce env var and interpret `BATCH_SIZE` as list of values (similar to `MIN_PAD_LENGTH` and `MAX_NEW_TOKENS`) - Adapt warmup loop over the zipped list containing **pad length**, **max new tokens** and **batch size** - Support batch dimension for input arguments (tokens, positions, masks) in warmup algorithm - Add batch dimension support in update function for the attention mask (`update_mask()` in sendnn.py) - Alter test scripts to work with `batch size > 1 ` #### The code has been tested in the following settings: - On **CPU**: `batch size = 4` and `batch size = 8` with `torch.compile(backend=inductor)` - On **AIU**: `batch size = 1` in both **offline** and **online** mode. ### Open questions (including unaddressed questions from [PR23](https://github.ibm.com/ai-foundation/vllm/pull/23)): - [x] verify code functionality for `batch size = 4` and `batch size = 8` on **AIU** - [ ] ideally, the `SENDNNWorker` checks how many compiled shapes fit in AIU memory before starting to warmup all of them. Unclear how to decide, implementation missing. - [ ] How to handle requests that are too long? Right now there are just cut to maximum padding length (we should probably fail the request and inform the client) - [ ] verify output of example prompts
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Cherry-pick of fix commit 6100f4b from ODH:
opendatahub-io/vllm#17